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In a stand-alone system, the use of renewable energies, load changes, and 
interruptions to transmission lines can cause voltage drops, impacting its 
reliability. A way to offset a change in the nature of hybrid renewable energy 
immediately is to utilize energy storage without needing to tum on other 
plants. Photovoltaic panels, a wind turbine, and a wallbox unit (responsible 
for providing the vehicle’s electrical need) are the components of the 
proposed system; in addition to being a power source, batteries also serve as 
a storage unit. Taking advantage of deep learning, particularly convolutional 
neural networks, and this new system will take advantage of recent advances 
in machine learning. By employing algorithms for deep Q-learning, the 
agent learns from the data of the various elements of the system to create the 
optimal policy for enhancing performance. To increase the learning 
efficiency, the reward function is implemented using a fuzzy Mamdani 
system. Our proposed experimental results shows that the new system with 
fuzzy reward using deep Q-learning networks (DQN) keeps the battery and 
the wallbox unit optimally charged and less discharged. Moreover confirms 
the economic advantages of the proposed approach performs better 
approximate to +25% Moreover, it has dynamic response capabilities and is 
more efficient over the existing optimization approach using deep learning 
without fuzzy logic. 
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1. INTRODUCTION 


In recent years, governments and industries around the globe have recognized the danger of global 
warming and are actively seeking alternatives to minimize fossil fuel greenhouse gas emissions. In reality, 
renewable energy represents a promising solution to reduce greenhouse gas emissions, and its use can be 
quite useful for a range of applications, including off-grid stand-alone systems [1]. In view of the fluctuating 
nature of renewable energy production, remote sites may benefit from storage devices. 

Among the renewable energy resources, both solar and wind power are derived indirectly or not 
from the weather [2]. Power is generated using hybrid renewable energy systems (HRES) by combining 
different types of renewable energy. HRES typically uses solar and wind energies alongside storage units 
such as batteries to store energy excess from input sources [3], [4]. The present article discusses an 
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unbundled system consisting of a power production system, a power consumption system, and a power 
storage system. 

The present article discusses an unbundled system consisting of a power production system, a power 
consumption system, and a power storage system: photovoltaic (PV) panels and wind turbines belong to the 
power production group; in the power consumption group are variable electric loads and hybrid electric 
vehicles; storage of power is limited to the battery bank. An agent-based collaborative energy management 
system for stand-alone systems is presented. Ultimately, our objective is to balance power between the 
wallbox units and cover the demand simultaneously in order to increase system reliability. Reinforcement 
learning enables the agents to learn the optimal policy. A continuous state-action space is dealt with by each 
agent by implementing deep Q-learning networks methods. 

A fuzzy logic approach to energy management was used before [5], [6] a network of neural 
connections is described in [7] and genetic algorithm in [8]. Fuzzy models demonstrated good estimations of 
the PV and wind turbine (WT) power output after applying them and as shown in [8], the proposed method 
was applied successfully to the analysis of a hybrid system supplying power for a telecommunications relay 
station. An agent-based system was suggested by the author [9] with a central coordinator to respond 
optimally to emergency power needs. Several studies have identified the benefits of autonomous multiagent 
systems for managing buying and selling power [10]. In study [11], an agent-based system is presented for 
generating, storing, and scheduling energy. 

Currently, many researchers are working on designing and building fuzzy logic controllers. Based 
on supervised learning, using training data is the most common method of designing fuzzy logic controllers 
(FLC). Real-world applications, however, sometimes make it impossible to obtain a piece of training data or 
extract an expert’s knowledge, especially when the cost of doing so is very high. Designing the inference 
rules is an important part of the FLC process. Experimental data is needed for this part. Developing fuzzy 
rules’ consequent parts is more challenging than developing their antecedent parts. Globally, not all the time 
it is easy to acquire specialized information as a priori expert knowledge is required to derive these fuzzy 
rules. Additionally, control based on fuzzy rules strategy that works under some conditions might not work 
under others. 

This is why reinforcement learning is utilized to increase the quality and performance of the system. 
Algorithm class is known as reinforcement learning, which allows computer systems to learn from 
experience. Currently, deep Q-learning networks (DQN) and Q-learning are among the most popular 
methods for reinforcement learning [12]. No matter what the environment looks like, these algorithms useful 
for learn from experiences by trial and error. In order to implement reinforcement learning over a continuous 
input/output domain, it is necessary to integrate fuzzy rules into deep Q-learning algorithms. 

Multiple engineering applications used the Q-learning algorithm. It is used in [13] to learn how to 
control a hybrid electric vehicle (HEV) online. Simulated fuel economy results of the combined control 
strategy are good. In study [14], a game-based fuzzy Q-learning technique was employed to detect and 
protect wireless sensor networks from intrusions. In contrast to the Markovian game theoretic. According to 
[15], there is a two step ahead Q-learning algorithm for scheduling the batteries in a wind turbine. On the 
other hand, [16] suggests a three step ahead Q-learning algorithm for scheduling batteries in a solar energy 
system. To reduce the power consumption of solar panels from the grid, [17] suggested a multi-agent system 
utilizing Q-learning, as far as successful defenses are concerned, the proposed model works better. 

Scaling is the problem with Q-learning. Whenever we are talking about complex environments, like 
designing a video game, a great deal of states and actions may exist. This problem becomes complicated 
when dealing with table state and action. It is here that artificial neural networks come in handy. 

Recently, deep learning becomes apparent as a powerful tool in solving complex problems and is 
rapidly becoming the state of the art in many fields, including speech recognition, natural language 
processing, and robotics. The proposed model was used to route traffic [18], showing that the results are 
generally fast paths that avoid frequent traffic stops at red lights. Qu et al. [19] proposes a deep reinforcement 
learning approach that incorporates double Q-learning, and demonstrate that to reduce overestimation 
observed over several games, as hypothesized, as well as to achieve much better performance overall. A 
radar signal based on deep Q-learning networks was implemented in [20] and showed that the resultant 
algorithm not only reduced overestimations as hypothesized, however, it improved performance on multiple 
games as well. Deep learning benefited from technological advances in processing central unit (PCU) and 
graphics processing unit (GPU) for fast calculations that are needed during training. 

In our previous work [21], using multi-agent technology, an intelligent system for the optimization 
and management of renewable energy systems has been developed. In addition to providing quality services 
for energy consumers, this work aims to satisfy the energy demand from solar and wind turbines, optimize 
battery usage, and enhance the battery-related performance. In some scenarios, simulation results revealed 
that the charger failed to meet the power demand, and that deep discharges were often observed even at a low 
charger state. Furthermore, a comparison between a multi agent system and fuzzy logic controller is made 
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in [22], and it was found that a multi agent system has a more efficient implementation and a quicker 
response time than a FLC. An improved exploration/exploitation strategy is presented for the agent in this 
study, depending on how often he is in a certain state. 

The paper has the following structure: the section 2 explains how reinforcement learning was used 
in this study. In the section 3 of this study examines a hybrid system of renewable energy. In section 4, we 
discuss the systems that were studied. In section 5, we discuss the results, and we close by reviewing the 
findings and providing some perspectives. 


2. REINFORCEMENT LEARNING 

Figure 1 illustrates how an agent interacts with its environment using perception and action 
according to the standard reinforcement learning model. The agent receives information about the current 
environment for each interaction step; the agent determines how to produce an output based on the 
information received. An agent is informed of the state change by a scalar reinforcement signal. A good 
agent’s behavior should seek to increase the value of reinforcement signal over time. By using algorithmic 
trial and error, this can be accomplished [23]. 


State s; Reward r; Action a 


Tt+4 


Environment 


Figure 1. Standard model of reinforcement learning 


2.1. Q-learning algorithm 

A recent form of reinforcement learning algorithm, Q-learning (Watkins in 1989), Models of its 
environment are not necessary and can be implemented online. This makes it ideal for repeated games with 
unknown opponents. Trial and error is the basis of Q-learning. A controller (or an agent) selects an 
appropriate action (or output) according to its Q-value at each stage. Q-learning applied to identify the 
optimal policy using delayed rewards, we attempt to determine the optimal value of the action: 
Q *(s,a) = maxr Qm(s,a), i.e. the best value achievable by any policy. This algorithm produces the 
amount of the Q table in which is stored the quantity of the dual action-state. In the standard online 
Q-learning process, the Bellman equation describes a one-step value updating, which is just the depth-1 
expansion of definition of Q:Q(s,a) =r +y maxa’ Q(s’,a’). The following algorithm includes the 
Q-learning process: 


For each state-action pair (s, a), Initiate the table entry Q(st,at) to zero and observe the current state s. 
Do forever: 

---Pick an action a and execute it 

---Immediately receive a reward r 

---Observe the new state s' 

---Update the table entry for Q (st, at) as (1): 


Q(st, at) = Q(st,at) + a. (R (st, at) + y maxa'Q(st’, at’) — Q(st, at)) (1) 
---S=s' 


In this case, Q(st, at) represents dual action-state in the Q table at time step t, a represents learning 
rate, which affects speed of convergence to final Q (Q *), a function reward r (st, at) for state s and action a 
at time step t, where y is the discount factor, the value included in the interval [0,1] that determines the value 
of a future reward function; and a t’ is the action at time step t’ [24]. The algorithm involves a number of 
stages, from determining an initial state to performing, it consists of a series of actions to rewarding when the 
goal state is reached. A stochastic approximation will lead to the true Q-learning function (i.e., to optimal 


Q and 2). 
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2.2. Deep Q-learning networks algorithm 

With deep Q-learning, deep learning is harnessed through so-called deep Q-networks. In order to 
calculate the Q value of a network, the neural network is a standard feed-forward model is used. A new 
Q-value can be generated by using the maximum output of the neural network and experiences from the past 
stored in a local memory. The illustration compares deep Q-learning and Q-learning in Figure 2. 


i Q-Learning 


Q-table 


State-Action 


Action 


Deep Neural Network Q-value Action 1 


KL Q-value Action 2 i 


i Deep Q-Learning 


Q-value Action n 


Figure 2. An analysis of the differences between Q-learning and deep Q-learning 


One important thing to notice is that deep Q-networks do not rely on traditional supervised learning 
methods, due to the lack of labeled expected output. Reinforcement learning depends on policy or value 
functions, which means the target continuously changes with each round of implementation. The agent 
utilizes two neural networks, for this reason, rather than just one. One network, called Q-network, calculates 
Q-values in state St, while another network, called target network, calculates them in state St + 1. The 
Q-network retrieves, considering the current state of St, and an action values Q (St, a). In order to calculate 
Q(St + 1,a) for the temporal difference target, next state St+1 is used by the target-network. The N“™ 
iteration of two networks were trained during this program produces measures of the Q-network, the target 
network's data is copied. Figure 3 illustrates all the steps involved in the process. 


Q(s, a; 0i) r+ ymax 


Figure 3. Target and Q-network 


The measures 0 (i-1) (weights, biases) of the target-network according to the parameter 0 (i) of the 
Q-network at a previous point in time. So, the measures of the target-network are frozen in time. Each 
iteration updates them with the Q-network parameters. Essentially, an experience replay is just a memory in 
the form of a series of tuples «a, rs, s’» as shown in Figure 4. 
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In Figure 4, agents are trained through interaction with their environment and receiving data that is 
used when learning the Q-network. In order to build a complete picture of transitions, the agent discovers the 
environment. Initial decisions are made randomly by the agent, and this becomes insufficient over time. The 
agent examines the environment and the Q-network to decide what to do. Our approach (combination of 
random behavior and Q-network behavior) is called the epsilon greedy method for the reason that we switch 
between random and Q policy using the probability hyper parameter epsilon. 


Replay buffer 
(memorizes) 


next_state, reward | 


state 


Epsilon-greed 
Sema Q_network_local ° 9 y 


action selection 


Q_network_target 


Calculate Q_target 


Q_network_local 


Figure 4. DQN algorithm concept 


The predicted value and target value are calculated using two distinct Q-networks (Q_network_local 
and Q_network_target) during learning. By copying the weights from the actual Q-network, the target 
network weights are updated after being frozen for several time steps. Stabilizing the training process by 
freezing the target Q-network for a while and updating its weights with the actual Q-network weights. We 
find that our training process is more stable when we apply a replay buffer that stores agent experience. After 
that, random samples are used for training. 

For both networks of «a, r s, s’»» random batches are used to calculate Q-values from the experience 
replay, and then backpropagation is performed. Calculating loss as the square of the difference between 
Q-value target and Q-value predicted. In other words, we want to reduce the distance between the predicted 
and target Q-value. This distance is expressed by the squared error loss function. By applying gradient 
descent algorithms, one can minimize this loss function. 


(0:) = Ea~u[(Yi — (s, a ; 6))'1 (2) 
Where 
Yi = Eg—plr + ymaxgQ(s’,a’'; 0i-1)|S¢ = 5,4; = a] (3) 


When Q-network is being trained, this occurs while parameters are transferred to target network later. 
Q-learning is a multi-stage process that involves several steps: 


Initiate replay memory D to its maximum capacity 

Assign random weights to the action-value function Q 

For episode=1; M do 

Initial sequence S,={x1} and preprocessed sequenced ~1=(¢1) 
for t=1; T do 

Pick a random action a, using probability ¢ 
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Otherwise select a, = max,Q*(s;,, a; 0) 

Execute action a,in emulator and observe reward r, and image X;,44 
Set St41 = Sts At, Pr41and preprocess Pr41 = Prs,,.1) 

Store transition (Qt;at; T; @t+1) in D 

Sample random mini batch of transitions (~;;4;;7;;9j41) from D 
Set 


= { rj for terminal St41 
Yj = rj+y max 41Q(St+1,a';0) for non-terminal St+1 
: -9))2 
Perform a gradient descent step on (y; — Q(9;,4;;8)) 
End for 
End for 


3. HYBRID SYSTEM ADOPTED FOR RENEWABLE ENERGY 

This study examines the use of a stand-alone hybrid PV/Wind/Battery hybrid renewable energy 
system (HRES) to supply electrical needs of a residential home or apartment on an isolated local. Our system 
combines the advantages of wind and solar energies to maximize efficiency. During the planning process for 
the installation of an HRES at an isolated site, it was determined that as well as weather changes, output and 
input sources required to satisfy the load demand were analyzed: a maximum of 1 kWc of energy per day, 
16 PV panels made up of 36 cells need to be combined in series with a generic wind turbine having a 1 kW 
peak rated power under standard conditions (25 °C and 300 W/m? of lighting). The system delivers a 
maximum power of 2 kW. Additionally, There are several components to the system like a number of 
batteries that can be used as either sources or consumers, and other consumers include the wallbox unit and 
the dynamic load [5]. The following Figure 5 shows how the HRES was used in this study: 


Figure 5. Energy system hybrid 


4. ENVIRONMENT OF DEEP Q-LEARNING AGENT 
4.1. State space and action 

The agent works in a standalone environment, as described previously. As a way of determining the 
system’s current state, data is collected from the system by the agent: It gathers data about the power demand 
of the consumers, the power powered by the photovoltaic panels, and the power created by wind turbines, 
wallbox unit (battery of vehicle) and the secours of the battery. For the agent’s states to be defined, the new 
power (Pnew) between the generated power (Psys) and the power demand (Pload), SOC, and the amount of 
energy entering the wallbox must be considered. 

Actions can be taken on both the battery and the wallbox by the agent. As shown in Figure 6, these 
actions are charging/discharging the battery and enabling/disabling the wallbox. This enables the agent to 
decide whether to recharge or discharge the battery as well as whether the wallbox will work. In our study, 
Q-learning networks are used to find an optimal relationship between system states and actions, as well as 
exploring the use of actions in different states through an exploration/exploitation strategy: Agents are 
responsible for managing the stand-alone system’s energy so as to fully recharge the batteries and to ensure 
that the wall box has enough electricity to meet the needs of the unit. Figure 7 depicts a general overview 
architecture of our DQN approach, which comprises of an input layer that represents state space input, three 
fully connected inner product layers (hidden layers), and an output layer that represents loss. Based on each 
possible action, the output is a Q-value. 
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Figure 6. Agent interaction with the environment 
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Figure 7. An overview of the neural network used in deep Q-learning 


4.2. Fuzzy reward 

In reinforcement learning, the reward signal comes from the environment, indicating whether an 
agent is performing a ‘good’ action. As part of the reward function, all the constraints that should be 
respected need to be specified, for improved learning efficiency, consider factors affecting decisions as well 
as interrelationships. An accurate reward function also impacts the computations required, as it results in a 
more complex function. Furthermore, flexible requirements for optimality are typically included in 
multi-criteria decision making. It is proposed that a fuzzy reward based on fuzzy system applied to strike the 
right balance between flexible requirements and implementation complexity [25]. 

By using fuzzy logic models, complex processes can be modeled in general terms, without using 
complicated models. According to classical set theory, there are crisp sets and fuzzy sets, while an element 
may belong to one or both. As defined by a number ranging from [0, 1] according to fuzzy set theory, a fuzzy 
set element's degree of membership is a function of its participation in the fuzzy set. By using fuzzy if/then 
rules, logic floue can express relationships between fuzzy variables in linguistic terms. Generally, such rules 
follow a generic format: 

R: if (k1 is Im) and/or (k2 is Im)....and/or (km is Im) then (y is O), where Im There are a number of 
fuzzy inputs, k=... (k1, k2,...km) is the crisp input vector, y is the output variable and O an expert’s fuzzy 
set. There are a variety of fuzzy implications implemented by the Mamdani method [26] due to 
computational efficiency, this method has been used in this article to find the value of total reward. As 
illustrated in Figure 8, by using four blocks, it is possible to specify the preceding procedure: The input 
vectors are first transformed into fuzzy values by the fuzzifier. Knowledge base and data base is the second 
block, where membership functions are defined and fuzzy rules are stored. Finally, a fuzzy inference engine 
is used to make approximate decisions based on fuzzy rules, followed by a defuzzifier to determine a crisp 
result. The reward is based on three factors: battery status (SOC), the amount of power flowing through the 
wallbox unit and the load (power demand). 
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Rsoc Reward Total 
Fuzzifier Inference | Defuzzifier 
——_—_— Pi 


Rvehicle | Engine 
Rioad 
Knowledge 
Base 


Figure 8. Block diagram of an FLS type Mamdani 


In the resulting state, charge state (Rsoc) of the battery, the vehicle (Rvenicie), and the load (Rioaa) must be taken 
into consideration, the factor of reward was calculated: 


Roc _  SOC-S0Cmin (4) 
SOCmax 
vehicle = Vehicle -vehicle min (5) 


Vehiclemax 


Load -Loadmin (6) 
load = SS 
Loadmax 


0 


Scaling is done by normalizing the value Rsoc,Rvehicie,Rioaa, Within the range [-1,1]. As determined by the 
values of the soc, vehicle, and load at maximum and minimum levels. 

A reward is calculated by using the fuzzy system input vector, which is composed of quantities 
obtained from (4), (5) and (6). Figure 9 uses three membership functions for each input. To cover the range 
of each input, quantifying the inputs in three areas provides sufficient detail. The P, A, G denote for Poor, 
average and good. Figure 10 illustrates rules and their results. The negative very big, negative big, negative 


small, zero, positive small, positive big, and positive very big denote by NVB, NB, NS, Z, PS, PB, and PVB 
respectively. 


10 


uN , 10 r 7 a n 
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Figure 9. Membership functions in input Figure 10. Membership functions in output 


This is done to assure that the charge state (SOC) and percentage of energy in the battery of hybrid 
vehicles are maintained at their maximum levels, while simultaneous covering the hybrid’s power 
consumption of the vehicle. When the battery and percentage of energy archived in the battery of a vehicle 
are not at their maximum values, it is necessary to serve the power demand by increasing the SOC and 


percentage of energy located in the battery at the same time. In case this is not possible, covering the energy 
demand should be the goal. 
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5. RESULTS AND DISCUSSION 
5.1. Workings of the system 

As can be seen in Figure 11, when photovoltaic panels (Ppv) and wind turbines (Pw) are added 
together, they create system power (Psys). Throughout every period of time (t). The formula for the system 
Pyys is: 


Pyys(t)=Ppo(t)+P w(t) (7) 


Batteries should be charged between 25% and 85% to ensure longer battery life. The second parameter is the 
SOC which is calculated by the formula (8) at each interval of time t (1 hour). 


SOC=Ppa/BC (8) 
Where Beapacity represents the capacity of the battery and Ppa represents its power. Maintain the battery charge 
state (SOC) between SOCninimum and SOCmaximuam. When: SOCmin=25% (at its minimum level, the battery 
cannot be discharged) SOCmax=85% (when a battery approaches its maximum capacity, it cannot be charged). 
Following is the formula for determining the maximum and minimum levels of the batteries: 

Batminimum=S OCminimum*B capacity (9) 

Batmaximum=S OC maximum* B capacity ( 10) 


Batnew, which indicates the battery level, is the third parameter considered, each time period (t), Batnew is 
calculated according to the following formula: 


Batnew=B attery+P photovoltaic} Pwina—Pneea ( 11 ) 


The load needs (Pneea) and the consumption of hybrid vehicle demand (Pven) for 3 days is represented by 
Figure 12. 


POWER 


600 —Somme de Pw 


Somme de Ppv 


v 
0 2 4 6 8 10121416 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 


HOURS 


Figure 11. Standard model of Power produced by the PW/PPV source for 3 days 


5.2. Results and discussion 

Several batteries were tested with their minimum charged levels and some were charged to their 
maximum levels to determine the effectiveness of the deep Q-learning network: Empty batteries: As a 
starting point, the battery is at its lowest value (PBattery=Batminimum=1000 W). The chart shows the battery 
level variation over three days. The Figure 13 present’s efficiency indicators regarding our approach using 
fuzzy reward deep Q-learning. Another approach does not use fuzzy reward signals, so the immediate reward 
is calculated according to the formula (12). 
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Rsoct Rvehicle+ Rload (12) 


R = 
total 3 


Full batteries: When the battery is charged to its maximum, it starts with its maximum value 
(PBattery=Batmaximum=2800 W). The Figure 14 shows the battery level variation over three days: 


250 
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POWER 
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100 
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in 
© 
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Figure 12. Standard daily load and the consumption of hybrid vehicle demand variation for 3 days 
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Figure 13. Battery power curve obtained when an empty battery is used to begins a system 
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Figure 14. Battery power curve obtained when a full battery is used to begins a system 
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An agent can perform one action for the battery of the system at any time, while another action can 
be performed for the electrical vehicle (depending on the state in the wallbox unit). The battery’s action can 
be either charging (‘Cg’) in case a surplus of energy provided by renewable energy sources and a charge of 
load is satisfied, or discharging (‘Dg) when the energy provided is not enough to satisfy the loads shown in 
Figure 15. For the action of the electrical vehicle, it will be only charging on demand if there is an excess of 
energy provided by renewable energy sources, alternatively from the system’s battery after satisfying the 
needs of load. In other words, the agent is able to take any combination of two actions at any time. 

Figures 13, 14 and 16 illustrate the way efficiency indicators change over time for each case. Three 
days are required to run the simulation. At the start of the simulation, the indicators were rising rapidly as 
more exploration was performed. In using fuzzy rewards, it is apparent that the proposed approach provides a 
better performance around 25% and provides a more dynamic response and is more efficient than the 
alternative method, which stabilizes at lower levels and less quickly. This comparison verifies that the system 
with fuzzy reward using deep Q-learning keeps the battery and the wallbox unit optimally charged and less 
discharged. The performance of the algorithm has been approved in simulations, a summary of the results 
confirm the economic advantages of the suggested approach over the existing optimization approach using 
deep learning without fuzzy logic. 
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Figure 15. Soc and wallbox unit (vehicle) 
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Figure 16. Power consumption of wallbox unit (vehicle) 


6. CONCLUSION 
Through controlling the energy flow in a standalone system, this paper demonstrates how deep 
Q-learning networks can provide insights into complex energy management problems. The independent 
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learner’s approach is used to minimize the state space and for improving the learning mechanism. To take 
advantage of state variables, the modified version of this strategy uses state-specific rewards and information 
that is local to each agent. A combination of exploration and exploitation algorithm enables each agent to 
learn quickly, converge to a policy very quickly, and demonstrate excellent performance. Future experiments 
will compare different MAS approaches. In addition, these techniques will be applied in real world settings 
where a standalone system contains multiple units. Our university has been actively support research on 
energy management problem solved by machine learning. 
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